Titanic Data Exploration

By Agboola Quam.

Preliminary Wrangling

This document explores the titanic dataset containing various of the passengers that were on-board. This dataset contains demographics and passenger information from 891 of the 2224 passengers and crew on board the Titanic.

Questions

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper- class.

What factors made people more likely to survive?

  • Were social-economic standing a factor in survival rate?
  • Did age, regardless of sex, determine your chances of survival?
  • Did more women survive than more men
  • Did passengers who are alone have higher survival rate than passengers with family on board

Assumption: We are going to assume that everyone who survived made it to a life boat and it wasn’t by chance or luck.

In [143]:
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
import os 
import gc 
import math
import sklearn
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn import model_selection
from sklearn.ensemble import RandomForestClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
In [2]:
# load in the dataset into a pandas dataframe, print statistics
titanic = pd.read_csv('titanic.csv')
titanic.sample(10)
Out[2]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
293 294 0 3 Haas, Miss. Aloisia female 24.0 0 0 349236 8.8500 NaN S
660 661 1 1 Frauenthal, Dr. Henry William male 50.0 2 0 PC 17611 133.6500 NaN S
791 792 0 2 Gaskell, Mr. Alfred male 16.0 0 0 239865 26.0000 NaN S
118 119 0 1 Baxter, Mr. Quigg Edmond male 24.0 0 1 PC 17558 247.5208 B58 B60 C
51 52 0 3 Nosworthy, Mr. Richard Cater male 21.0 0 0 A/4. 39886 7.8000 NaN S
46 47 0 3 Lennon, Mr. Denis male NaN 1 0 370371 15.5000 NaN Q
121 122 0 3 Moore, Mr. Leonard Charles male NaN 0 0 A4. 54510 8.0500 NaN S
279 280 1 3 Abbott, Mrs. Stanton (Rosa Hunt) female 35.0 1 1 C.A. 2673 20.2500 NaN S
277 278 0 2 Parkes, Mr. Francis "Frank" male NaN 0 0 239853 0.0000 NaN S
778 779 0 3 Kilgannon, Mr. Thomas J male NaN 0 0 36865 7.7375 NaN Q
In [3]:
#1 Checking the dimension of the titanic dataframe
titanic.shape
Out[3]:
(891, 12)

We have 891 rows and 12 columns in our data frame

In [4]:
#2 Checking the data structure and also if there is missing value in any row in the data frame
titanic.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
In [5]:
#missing data
total = titanic.isnull().sum().sort_values(ascending=False)
percent = (titanic.isnull().sum()/titanic.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data
Out[5]:
Total Percent
Cabin 687 0.771044
Age 177 0.198653
Embarked 2 0.002245
Fare 0 0.000000
Ticket 0 0.000000
Parch 0 0.000000
SibSp 0 0.000000
Sex 0 0.000000
Name 0 0.000000
Pclass 0 0.000000
Survived 0 0.000000
PassengerId 0 0.000000

Let's analyse this to understand how to handle the missing data.

We'll consider that when more than 15% of the data is missing, we should delete the corresponding variable and pretend it never existed. This means that we will not try any trick to fill the missing data in these cases. According to this, there is just a variable (Cabin) that we should delete. The point is: will we miss this data? I don't think so. None of these variables seem to be very important, since most of them are not aspects in which actually related to the survival of passengers (maybe that's the reason why data is missing?).

In what concerns the remaining cases, we can see that 'embarked' variable just have the two cases of missing data. Since it is just two observation, we'll delete these observations and keep the variable.

Regarding the remaining variables, we can see there were no missing cases.

(1) We can see we have missing values in Age column

(2)We can also see we have missing in cabin and embarked column, although we wont be analyzing cabin

The first thing to do before dropping values is to drop specific rows with no values or fill in the missing value with the mean

In [6]:
#droping the cabin, passengerid, name, ticket column from our dataset because it's not useful for analysis
titanic.drop(['PassengerId','Name','Ticket','Cabin'],axis=1,inplace=True)
In [7]:
#Changing all the column names to lowercase and underscore for consistency and easy data cleaning.
titanic.rename(columns={'Survived':'survived','Pclass':'pclass','Sex':'sex','Age':'age','SibSp':'sibsp','Parch':'parch','Fare':'fare','Embarked':'embarked'},inplace=True)
In [8]:
#Checking if the changes has been applied
titanic.head()
Out[8]:
survived pclass sex age sibsp parch fare embarked
0 0 3 male 22.0 1 0 7.2500 S
1 1 1 female 38.0 1 0 71.2833 C
2 1 3 female 26.0 0 0 7.9250 S
3 1 1 female 35.0 1 0 53.1000 S
4 0 3 male 35.0 0 0 8.0500 S

TITANIC DATA DESCRIPTION

(1) passengerid = the passengers id

(2) survived = whether the passengers survived or not (0-not survived, 1-survived)

(3) pclass = passenger ticket class (1-high class, 2-middle class, 3-low class)

(4) name,sex,age = the name, gender =, and age of passengers

(5) ticket, fare, cabin, embarked = the ticket, the amount paid, cabin, and in embarks

(6) sibsp = the number of siblings and spouse each passengers have on board

(7) parch = the number of parents and children each passengers have on board

FILLING IN MISSING VALUES

In [9]:
#checking histogram of entite titanic dataset
titanic.hist(figsize=(10,8));

Age seems to be a lot that are missing, 714 instead of 891 entries.

We can just look at what the rows look like, because if they all have the same characteristics, it will be good to know The null values can be from the same group

In [10]:
#we look at dataframe where the age is null using the histogram plot
titanic[titanic.age.isnull()].hist();

We can see that age is off from the general plot above, so we will fill in the values with the mean

In [11]:
#Fillin in the missing value and rechecking the dataset info
titanic.fillna(titanic.mean(),inplace=True)
titanic.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 8 columns):
survived    891 non-null int64
pclass      891 non-null int64
sex         891 non-null object
age         891 non-null float64
sibsp       891 non-null int64
parch       891 non-null int64
fare        891 non-null float64
embarked    889 non-null object
dtypes: float64(2), int64(4), object(2)
memory usage: 55.8+ KB

The ages were all filled with the means

But the other variable (embarked) cannot be filled with the mean value because it is not a numeric data, it doesn't have a mean, just letters.

In [12]:
#checking the rows that are missing in embarked column
titanic[titanic.embarked.isnull()]
Out[12]:
survived pclass sex age sibsp parch fare embarked
61 1 1 female 38.0 0 0 80.0 NaN
829 1 1 female 62.0 0 0 80.0 NaN

But it looks like the embarked column only have 2 missing values, 889 entries instead of 891 entries

Since it is just a small amount of missing, we can just drop them.

In [13]:
#dropping the rows in embarked that are missing and rechecking the data info
titanic.dropna(inplace=True);
titanic.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 889 entries, 0 to 890
Data columns (total 8 columns):
survived    889 non-null int64
pclass      889 non-null int64
sex         889 non-null object
age         889 non-null float64
sibsp       889 non-null int64
parch       889 non-null int64
fare        889 non-null float64
embarked    889 non-null object
dtypes: float64(2), int64(4), object(2)
memory usage: 62.5+ KB

Now we have a complete dataset with no missing value

In [14]:
# Taking a look at some survival rates for babies
youngest_to_survive = titanic[titanic['survived'] == True]['age'].min()
youngest_to_die = titanic[titanic['survived'] == False]['age'].min()
oldest_to_survive = titanic[titanic['survived'] == True]['age'].max()
oldest_to_die = titanic[titanic['survived'] == False]['age'].max()

print('Youngest to survive :: ',youngest_to_survive)
print('Youngest to die :: ',youngest_to_die)
print('Oldest to survive :: ',oldest_to_survive)
print('Oldest to die :: ',oldest_to_die)
Youngest to survive ::  0.42
Youngest to die ::  1.0
Oldest to survive ::  80.0
Oldest to die ::  74.0

What is the structure of your dataset?

There are 889 passengers in the titanic dataset with 8 features (survived, pclass, sex, age, sibsp, parch, fare, and embarked). Most variables are numeric in nature, but the variables cut, color, and clarity are binary factor variables with the following levels.

(1) survived = whether the passengers survived or not (0-not survived, 1-survived)

(2) pclass = passenger ticket class (1-low class, 2-middle class, 3-high class)

(3) sex = female , male

What is/are the main feature(s) of interest in your dataset?

I'm most interested in figuring out what features are best for predicting the survival of passengers in the titanic dataset.

EXPLORATION DATA ANALYSIS

(1) Univariate exploration

By looking at one variable at a time, we can build an intuition for how each variable is distributed before moving on to more complicated interactions between variables.

Let's start our exploration by looking at the main variable of interest: price. Is the distribution skewed or symmetric? Is it unimodal or multimodal?

In [15]:
#descriptive statistics summary of passengers age
titanic['age'].describe()
Out[15]:
count    889.000000
mean      29.653446
std       12.968366
min        0.420000
25%       22.000000
50%       29.699118
75%       35.000000
max       80.000000
Name: age, dtype: float64

Observations

  • We have 889 passengers inside the titanic ship
  • The minimum age of passengers is like five months old (5months/12 months = 0.42 years)
  • The maximum age of passengers is 80 years
  • The mean age of passengers is 27 years
  • While the deviation is of 13 years
In [16]:
# univariate plot of passengers age
sb.distplot(titanic['age']);
In [17]:
#descriptive statistics summary of passengers ticket fare
titanic['fare'].describe()
Out[17]:
count    889.000000
mean      32.096681
std       49.697504
min        0.000000
25%        7.895800
50%       14.454200
75%       31.000000
max      512.329200
Name: fare, dtype: float64

Observations

  • We have 889 passengers inside the titanic ship
  • The minimum fare passengers paid for ticket is 0 dollar
  • The maximum fare passengers paid for ticket is 512.329 dollars
  • The mean fare passengers paid for ticket is 32.1 dollars
  • While the deviation is of 49.7 dollars
In [18]:
# univariate plot of passengers ticket fare
sb.distplot(titanic['fare']);
In [19]:
#Histogram plot for numeric variables in the titanic dataset
fig = plt.figure(figsize = (8,8))
ax = fig.gca()
titanic.hist(ax=ax)
plt.show()
/opt/conda/lib/python3.6/site-packages/IPython/core/interactiveshell.py:2961: UserWarning: To output multiple subplots, the figure containing the passed axes is being cleared
  exec(code_obj, self.user_global_ns, self.user_ns)

Fare - the majority of the passenbers didn't pay too much, it's skewed to the right

Pclass - most people are in third class

Survived - most people didn't survive than survived

sibsp - most people didn't come with their siblings or spouses

parch - most people didn't come with parents or children

age - age is also skewed to the right with majorith being around 20 and 40

BAR CHART OF QUALITATIVE VARIABLE IN THE DATASET

In [20]:
#Checking number of passengers that survived and didn't survive in the titanic dataset
pd.DataFrame(titanic.survived.value_counts())
Out[20]:
survived
0 549
1 340

we see that 549 didn't survive, while 340 survived

In [21]:
#Checking counts of gender of the passengers
pd.DataFrame(titanic.sex.value_counts())
Out[21]:
sex
male 577
female 312

we see that most passengers are male with 577, while female with 312

In [22]:
#Checking counts of class of the passengers
pd.DataFrame(titanic.pclass.value_counts())
Out[22]:
pclass
3 491
1 214
2 184

we see that most passengers are in low class with 491 passengers, 184 passengers are in middle slass, while 214 passengers are in first class

In [23]:
#Checking counts of age of the passengers
age=pd.DataFrame(titanic.age.value_counts())
In [24]:
#Checking counts in our dataset with bar chart
plt.figure(figsize=(10,4))
plt.subplot(1,2,1);
titanic.survived.value_counts().plot(kind='bar',title='Survival of Passengers',color=['C0','C1']);
plt.xlabel('Passengers')
plt.ylabel('Counts');


plt.subplot(1,2,2);
titanic.pclass.value_counts().plot(kind='bar',title='Class of the passengers',color=['C4','C5','C6']);
plt.xlabel('Class')
plt.ylabel('Counts');
In [25]:
titanic.sex.value_counts().plot(kind='bar',title='Gender of the passengers',color=['C2','C3']);
plt.xlabel('Gender')
plt.ylabel('Counts');
In [26]:
age.plot(kind = "bar", figsize = (20,8))
plt.ylabel("Number of Passengers")
plt.xlabel("Age")
plt.title("Total Number of Passengers")
plt.show()
In [27]:
age_of_passengers = titanic.groupby("age").size()
In [28]:
# Plot bar graph for Age of passengers on titanic like a normal distribution data

age_of_passengers.plot(kind = "bar", figsize = (20,8),color='red')
plt.ylabel("Number of Passengers")
plt.xlabel("Age")
plt.title("Total Number of Passengers")
plt.show()

PIE CHART FOR QUALITATIVE VARIABLE

In [29]:
plt.figure(figsize=(10,4))
plt.subplot(1,2,1);
survive_values=[549,340]
survive_labels=["No","Yes"]
plt.axis("equal")
plt.title("Pie chart of passengers who survived and didn't survived titanic")
plt.pie(survive_values,labels=survive_labels,radius=1.0,autopct='%0.1f%%',shadow=True,explode=[0,0.1],wedgeprops={'edgecolor':'black'});

plt.subplot(1,2,2);
gender_values=[312,577]
gender_labels=["Female","Male"]
plt.axis("equal")
plt.title("Pie chart showing the passengers gender counts")
plt.pie(gender_values,labels=gender_labels,radius=1.0,autopct='%0.1f%%',shadow=True,explode=[0,0.1],wedgeprops={'edgecolor':'black'});
In [30]:
class_values=[214,184,491]
class_labels=["high class","middle class","low class"]
plt.axis("equal")
plt.title("Pie chart showing the passengers class counts")
plt.pie(class_values,labels=class_labels,radius=1.0,autopct='%0.1f%%',shadow=True,explode=[0,0,0.1],wedgeprops={'edgecolor':'black'});

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

When investigating the age, and embarked variables, we saw a problem (completeness issue) The age and embarked column contained missing values which i had to deal with

And also i noticed a consistenscy issue while assessing visually with the variable names, so i had to change varible names to lower case for consistency.

(2) Bivariate Exploration

To start off with, I want to look at the pairwise correlations present between features in the data.

Through these bivariates plots, we can learn how changes in one variable might affect the variable in the second, and identify clusters and patterns in the dataset.

In [31]:
numeric_vars = ['age', 'fare']
categoric_vars = ['survived', 'pclass', 'sex','sibsp','parch','embarked']
In [32]:
# correlation plot
plt.figure(figsize = [8, 5])
sb.heatmap(titanic[numeric_vars].corr(), annot = True, fmt = '.3f',
           cmap = 'vlag_r', center = 0)
plt.show()

The relationship between survived and the predictor variables

In [33]:
#survival correlation matrix
#correlation matrix
corrmat = titanic.corr()
f, ax = plt.subplots(figsize=(12, 9))
k = 8 #number of variables for heatmap
cols = corrmat.nlargest(k, 'survived')['survived'].index
cm = np.corrcoef(titanic[cols].values.T)
sb.set(font_scale=1.25)
hm = sb.heatmap(cm, cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size': 10}, yticklabels=cols.values, xticklabels=cols.values)
plt.show()
In [34]:
# plot matrix: checking relationship between the numerical variables in the dataset

g = sb.PairGrid(data = titanic, vars = numeric_vars)
g = g.map_diag(plt.hist, bins = 20);
g.map_offdiag(plt.scatter);

As expected, the numeric variables (age and fare) dimensions are all highly correlated with one another

In [35]:
#using heat map
#Is there a relationships between the price of tickets in dollars and the age of the passengers
plt.hist2d(data=titanic,x='age',y='fare',cmin=0.5,cmap='viridis_r');
plt.colorbar()
plt.xlabel('age of passengers')
plt.ylabel('amount paid($)');

BOX PLOT AND VIOLIN PLOT SHOWING RELATIONSHIP BETWEEN QUANTITAIVE VARIABLE AND QUALITATIVE VARIABLE

Box plots are used to show overall patterns of response for a group. They provide a useful way to visualise the range and other characteristics of responses for a large group. They allow comparing groups of different sizes.

The unquestionable advantage of the violin plot over the box plot is that aside from showing the abovementioned statistics it also shows the entire distribution of the data. This is of interest, especially when dealing with multimodal data, i.e., a distribution with more than one peak.

I used box plot to check the relationship between qualitative and quantitative variable in the titanic dataset, boxplot do a fine job of summarizing the titanic dataset, but there are some distributional details that can het lost which can be seen with the violin plot

In [36]:
#Plotting a box plots for relationship between quantitative(fare) and qualitative variable(survived)
base_color=sb.color_palette()[0]
sb.boxplot(data=titanic,x='survived',y='fare',color=base_color);
plt.xticks(rotation=15)
plt.title('Box plot of fare vs success');

We can see from the box plot that the passengers that didn't survive are more than those who did

In [37]:
#Plotting a violin plots for relationship between quantitative(fare) and qualitative variable(survived)
base_color=sb.color_palette()[0]
sb.violinplot(data=titanic,x='survived',y='fare',color=base_color);
plt.xticks(rotation=15);

The violin plot gave us a detiled exlanation thet those who didn't survive paid lesser fare for the ticket while those that have higher chance of survival paid more

In [38]:
#Plotting a box plots for relationship between quantitative(fare) and qualitative variable(pclass)
base_color=sb.color_palette()[0]
sb.boxplot(data=titanic,x='pclass',y='fare',color=base_color);
plt.xticks(rotation=15)
plt.title('Box plot of fare vs pclass');

We can see from the box plot that most passengers paid for low class while least passengers paid for high class

In [39]:
#Plotting a violin plots for relationship between quantitative(fare) and qualitative variable(fare)
base_color=sb.color_palette()[0]
sb.violinplot(data=titanic,x='pclass',y='fare',color=base_color);
plt.xticks(rotation=15)
plt.title('Violin plot of fare vs pclass');
In [40]:
#Plotting a box plots for relationship between quantitative(fare) and qualitative variable(sex)
base_color=sb.color_palette()[0]
sb.boxplot(data=titanic,x='sex',y='fare',color=base_color);
plt.xticks(rotation=15)
plt.title('Box plot of fare vs sex');

We can see from the box plot that most passengers in the titanic that paid more fare are females

In [41]:
#Plotting a violin plots for relationship between quantitative(fare) and qualitative variable(sex)
base_color=sb.color_palette()[0]
sb.violinplot(data=titanic,x='sex',y='fare',color=base_color);
plt.xticks(rotation=15)
plt.title('Violin plot of fare vs sex');
In [42]:
#Plotting a box plots for relationship between quantitative(age) and qualitative variable(survived)
base_color=sb.color_palette()[0]
sb.boxplot(data=titanic,x='survived',y='age',color=base_color);
plt.xticks(rotation=15)
plt.title('Box plot of age vs survived');
In [43]:
#Plotting a violin plots for relationship between quantitative(age) and qualitative variable(survived)
base_color=sb.color_palette()[0]
sb.violinplot(data=titanic,x='survived',y='age',color=base_color);
plt.xticks(rotation=15)
plt.title('Violin plot of age vs survived');
In [44]:
#Plotting a box plots for relationship between quantitative(age) and qualitative variable(pclass)
base_color=sb.color_palette()[0]
sb.boxplot(data=titanic,x='pclass',y='age',color=base_color);
plt.xticks(rotation=15)
plt.title('Box plot of age vs pclass');
In [45]:
#Plotting a box plots for relationship between quantitative(age) and qualitative variable(pclass)
base_color=sb.color_palette()[0]
sb.violinplot(data=titanic,x='pclass',y='age',color=base_color);
plt.xticks(rotation=15)
plt.title('Violin plot of age vs pclass');
In [46]:
#Plotting a box plots for relationship between quantitative(fare) and qualitative variable(age)
base_color=sb.color_palette()[0]
sb.boxplot(data=titanic,x='sex',y='age',color=base_color);
plt.xticks(rotation=15)
plt.title('Box plot of sex vs age');
In [47]:
#Plotting a violin plots for relationship between quantitative(fare) and qualitative variable(age)
base_color=sb.color_palette()[0]
sb.violinplot(data=titanic,x='sex',y='age',color=base_color);
plt.xticks(rotation=15)
plt.title('Violin plot of age vs sex');
In [48]:
# plot matrix of numeric features against categorical features.
# can use a larger sample since there are fewer plots and they're simpler in nature.


def boxgrid(x, y, **kwargs):
    """ Quick hack for creating box plots with seaborn's PairGrid. """
    default_color = sb.color_palette()[0]
    sb.boxplot(x, y, color = default_color)

plt.figure(figsize = [90, 90])
g = sb.PairGrid(data = titanic, y_vars = ['fare', 'age'], x_vars = categoric_vars,
                size = 3, aspect = 1.5)
g.map(boxgrid)
plt.show();
<matplotlib.figure.Figure at 0x7f6b53bf6fd0>
In [49]:
survive=titanic.survived==True
died=titanic.survived==False

AGE VS SURVIVED

Comparing the distribution of Age for the passengers who survived and didn't survive

In [50]:
titanic.age[survive].hist(alpha=0.5,label='survived')
titanic.age[died].hist(alpha=0.5,label='died')
plt.legend()
plt.xlabel('age')
plt.ylabel('survival counts')
plt.title('Histogram distribution of age vs survival');

It does look like the really young children have a higher chance of surviving than other ages

CLUSTERED AND STACKED BAR CHART

Showing relationships between two categorical varibales in the titanic dataset

GENDER VS SURVIVED

Comparing the distribution of passengers gender who survived and didn't survive

In [51]:
#grouping two categorical variables together(sex and survived)
bygender=titanic.groupby("sex").survived.value_counts()
bygender
Out[51]:
sex     survived
female  1           231
        0            81
male    0           468
        1           109
Name: survived, dtype: int64

This result below, we have 340 out of 889 passengers that survived the titanic; which has a percentage of 38.25%,where females (26%) where more than male (12.261%) that survived

While 549 out of 889 patients did not survive the titanic; which has a percentage of 62%,where females (14.75%) where more than male (85,25%) that did not survived

That is female are more likelt to survive than male .i.e. males passengers are likely to not survive the titanic than females passengers

In [52]:
#plotting stacked bar chart of gender vs survived
bygender.unstack().plot(kind='bar',stacked=True);
plt.title('Gender vs Survived',fontsize=18)
plt.xlabel('gender',fontsize=18)
plt.ylabel('passengers',fontsize=18);

This plot shows that females passengers are likely to survive the titanic than males passengers (.i.e. males are less likely to not survive). It does look like there are definitely more females surviving than male

In [53]:
#plotting clustered bar chart of gender vs survived
bygender=bygender.reset_index(name='count')
bygender.pivot(index='sex',columns='survived',values='count')
Out[53]:
survived 0 1
sex
female 81 231
male 468 109
In [54]:
sb.countplot(data=titanic,x='sex',hue='survived')
plt.xticks(rotation=15)
plt.title('Gender vs Survived');

This plot shows that females passengers are likely to survive the titanic than males passengers (.i.e. males are less likely to not survive). It does look like there are definitely more females surviving than male

GENDER VS PCLASS

Comparing the distribution of passengers gender and the class of the tickets they paid for

In [55]:
#grouping two categorical variables together(sex and pclass)
byc=titanic.groupby("sex").pclass.value_counts()
byc
Out[55]:
sex     pclass
female  3         144
        1          92
        2          76
male    3         347
        1         122
        2         108
Name: pclass, dtype: int64

Note, male passengers (64.9%) are more than female passengers(35.1%) in the titanic dataset

  • This result below, we have 214 out of 889 passengers that are in first class; which has a percentage of 24.1%,where females (10.35%) where more than male (13.723%) that were in first class.

  • We have 184 out of 889 passengers that are in second class; which has a percentage of 20.7%,where females (8.55%) where more than male (12.15%) that were in second class class.

  • We have 491 out of 889 passengers that are in third class; which has a percentage of 55.23%,where females (16.2%) where more than male (39.03%) that were in third class.

In general, we have more passengers in low class (55.2%) than the middle class(20.7%) and high class(24.1%).

In [56]:
#plotting stacked bar chart of gender vs pclass
byc.unstack().plot(kind='bar',stacked=True);
plt.title('Gender vs pclass',fontsize=18)
plt.xlabel('gender',fontsize=18)
plt.ylabel('passengers',fontsize=18);

From the plot, it does look like female are in more expensive, or are spending more, probably more in first class

In [57]:
#plotting clustered bar chart of gender vs pclass
byc=byc.reset_index(name='count')
byc.pivot(index='sex',columns='pclass',values='count')
Out[57]:
pclass 1 2 3
sex
female 92 76 144
male 122 108 347
In [58]:
sb.countplot(data=titanic,x='sex',hue='pclass')
plt.xticks(rotation=15)
plt.title('Gender vs Pclass');

SURVIVED VS PCLASS

Comparing the distribution of passengers who survived and didn't survive and the class of the tickets they paid for

In [59]:
#grouping two categorical variables together(survived and pclass)
bys=titanic.groupby("survived").pclass.value_counts()
bys
Out[59]:
survived  pclass
0         3         372
          2          97
          1          80
1         1         134
          3         119
          2          87
Name: pclass, dtype: int64
  • in the result above, we have 134 out of 889 passengers that are in first class that survived the titanic; which has a percentage of 15.1% while 80 out of 889 passengers that are in first class didn't survive the titanic; with the percentage of 8.999%.

  • We have 87 out of 889 passengers that are in second class that survived the titanic; which has a percentage of 9.79% while 97 out of 889 passengers that are in second class didn't survive the titanic; with the percentage of 10.911%.

  • We have 119 out of 889 passengers that are in third class that survived the titanic; which has a percentage of 13.386% while 372 out of 889 passengers that are in third class didn't survive the titanic; with the percentage of 41.85%.

In general, passengers in high class are more likely to survive than passengers in middle class and low class, because the probability of surviving is more than not surviving.

In [60]:
#plotting stacked bar chart of survived vs pclass
bys.unstack().plot(kind='bar',stacked=True);
plt.title('survived vs pclass',fontsize=18)
plt.xlabel('survived',fontsize=18)
plt.ylabel('passengers',fontsize=18);
In [61]:
#plotting clustered bar chart of survived vs pclass
bys=bys.reset_index(name='count')
bys.pivot(index='survived',columns='pclass',values='count')
Out[61]:
pclass 1 2 3
survived
0 80 97 372
1 134 87 119
In [62]:
sb.countplot(data=titanic,x='survived',hue='pclass')
plt.xticks(rotation=15)
plt.title('Survived vs Pclass');

sibsp vs survived

Comparing the distribution of having family (siblings and spouse) on board is associated with survival

In [63]:
#grouping two variables together(survived and sibsp)
bysib=titanic.groupby("sibsp").survived.value_counts()
In [64]:
#plotting stacked bar chart of survived vs siblings/spouses
bysib.unstack().plot(kind='bar',stacked=True);
plt.title('survived vs siblings/spouses',fontsize=18)
plt.xlabel('siblings/spouse',fontsize=18)
plt.ylabel('passengers',fontsize=18);

From our result;

  • A lot of people who have lots of families didn't appear to be surviving
  • Passengers who have one(1) survived more by a little bit
  • And majority of passengers who are alone (i.e. 0(zer0)) didn't appear to be surviving, probably most of the passengers are in this category (0-alone)

parch vs survived

Comparing the distribution of having family (parents and children) on board is associated with survival

In [65]:
#grouping two variables together(survived and parch)
bypar=titanic.groupby("parch").survived.value_counts()
In [66]:
#plotting stacked bar chart of survived vs parent/children
bypar.unstack().plot(kind='bar',stacked=True);
plt.title('survived vs parents/children',fontsize=18)
plt.xlabel('parents/children',fontsize=18)
plt.ylabel('passengers',fontsize=18);

From our result;

  • A lot of people who have lots of families didn't appear to be surviving
  • Passengers who have one(1) survived more by a little bit
  • And majority of passengers who are alone (i.e. 0(zer0)) didn't appear to be surviving, probably most of the passengers are in this category (0-alone)

In general, passengers with big families didn't appear to be survive as well as those who are alone

EMBARKED VS SURVIVED

Where the passengers were currently when they add impact with the iceberg and their chances of survival

In [67]:
#grouping two variables together(survived and embarked)
byemb=titanic.groupby("embarked").survived.value_counts()
byemb
Out[67]:
embarked  survived
C         1            93
          0            75
Q         0            47
          1            30
S         0           427
          1           217
Name: survived, dtype: int64
In [68]:
#plotting stacked bar chart of survived vs embarked
byemb.unstack().plot(kind='bar',stacked=True);
plt.title('survived vs embarked',fontsize=18)
plt.xlabel('embarked',fontsize=18)
plt.ylabel('passengers',fontsize=18);
  • We can see from our plot that the passengers in the S category seemed to be not having as much luck
  • Passengers in C category seemed a little bit better
  • And passengers in category Q is not that great too

In general, passengers that embarked from Cherbourg are more likely to survive than other point of embark. Therefore, embarked seem to have some association with the chance of survival

In [69]:
#scatterplot
sb.set()
cols = ['fare', 'age', 'survived', 'pclass', 'parch', 'sibsp']
sb.pairplot(titanic[cols], size = 2.5)
plt.show();
In [70]:
sb.lmplot('age','survived',data=titanic);
In [71]:
sb.lmplot('age','survived',data=titanic,hue='pclass');

This plot shows that older passengers are less likely to survive.

In [72]:
sb.lmplot('age','survived',data=titanic,hue='sex');

The number of passengers boarded at Southhampton are more compared to Cherbourg and Queenstown, but Cherbour passengers are more likely to survive than Southhampton passengers. So there is a chance that Embarked helps in prediction.

In [ ]:
 

(3) MULTIVARIATE EXPLORATION

Visualizing three or more vriables

pclass vs age vs fare vs survived

In [73]:
g = sb.FacetGrid(titanic, col="pclass", hue="survived")
g.map(plt.scatter, "fare", "age", alpha=.7)
g.add_legend();
In [74]:
g = sb.FacetGrid(titanic, row="survived", col="pclass", margin_titles=True)
g.map(sb.regplot, "age", "fare", color=".3", fit_reg=False, x_jitter=.1);

parch vs age vs fare vs survived

In [75]:
g = sb.FacetGrid(titanic, row="survived", col="parch", margin_titles=True)
g.map(sb.regplot, "age", "fare", color=".3", fit_reg=False, x_jitter=.1);

sibsp vs age vs fare vs survived

In [76]:
g = sb.FacetGrid(titanic, row="survived", col="sibsp", margin_titles=True)
g.map(sb.regplot, "age", "fare", color=".3", fit_reg=False, x_jitter=.1);
In [77]:
ttype_markers=[['male','o'],['female','^']]
for ttype, marker in ttype_markers:
    plot_data=titanic.loc[titanic['sex']==ttype]
    sb.regplot(data=plot_data,x='fare',y='age',x_jitter=0.04,marker=marker,fit_reg=False);
    plt.xlabel('Passengers Fare')
    plt.ylabel('Passengers Age')
    plt.title('AGE vs FARE vs SEX')
    plt.legend(['male','female']);
    
In [148]:
#Class and gender wise segregation of passengers
sb.factorplot('survived', col='pclass', hue='sex', data=titanic, kind='count', size=7, aspect=.8)
plt.subplots_adjust(top=0.9)
In [78]:
sb.pointplot(data = titanic, x = 'survived', y = 'fare', hue = 'pclass',
             palette = 'Greens', linestyles = '', dodge = 0.4)
plt.title('passengers fare across survived and pclass')
plt.ylabebl('Mean Fare ($)')
plt.yscale('log')
plt.yticks([20, 40, 60, 80, 100, 120, 140],[20, '40', '60', '80', '100', '120', '140'])


plt.show();
In [79]:
facet_grid = sb.FacetGrid(titanic, col='survived', row='pclass', size=2.2, aspect=1.6)
facet_grid.map(plt.hist, 'age', alpha=.5, bins=20)
facet_grid.add_legend();

This plot shows that a passenger in higher class are more likely to survive than passengers in lower class

In [80]:
#Log transforming the x-variable(price) because the price variable was skewed
#so we log transform price variable to reduce skewness
fare_log = np.log10(titanic['fare'])
age_log = np.log10(titanic['age'])
/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:3: RuntimeWarning: divide by zero encountered in log10
  This is separate from the ipykernel package so we can avoid doing imports until
In [81]:
# multivariate plot of price by carat weight, and clarity
#Using faceted scatterplot
g=sb.FacetGrid(data=titanic,col='survived',margin_titles=True,col_wrap=3)
g.map(plt.scatter,'age','fare');
In [82]:
# multivariate plot of fare by pclass and survived
#Fare by survived and pclass using box plot

sb.pointplot(data=titanic,x='pclass',y='fare',hue='survived',ci='sd',linestyles=" ",dodge=True);
plt.xticks(rotation=15)
plt.ylabel('Average fare($)');
In [83]:
# multivariate plot of fare by pclass and survived
#Fare by survived and pclass using box plot


sb.boxplot(x='pclass',y='fare',hue='survived',data=titanic);
sb.set(style="ticks")
plt.xticks(rotation=15)
plt.ylabel('Average fare($)');
plt.legend(bbox_to_anchor=(1.01, 1),borderaxespad=0)
plt.title("Seaborn Plot with Legend Outside")
plt.tight_layout()
plt.savefig("place_legend_outside_plot_Seaborn_boxplot.png",
                    format='png',dpi=150)

It can be seen that there is a relationship between fare, pclass and survival

It can be seen through the box plot that people in high class are more likely to survive than middle and low class

In [84]:
#Making 3 copies of titanic data set for the three models
titanic_rf=titanic.copy()
In [85]:
titanic_lda=titanic.copy()
In [86]:
titanic_log=titanic.copy()

PREDICTION ALGORITHM

Random Forest

In [87]:
# Convert string values to int values for the ease of prediction
titanic_rf = pd.get_dummies(titanic_rf)
titanic_rf.head()
Out[87]:
survived pclass age sibsp parch fare sex_female sex_male embarked_C embarked_Q embarked_S
0 0 3 22.0 1 0 7.2500 0 1 0 0 1
1 1 1 38.0 1 0 71.2833 1 0 1 0 0
2 1 3 26.0 0 0 7.9250 1 0 0 0 1
3 1 1 35.0 1 0 53.1000 1 0 0 0 1
4 0 3 35.0 0 0 8.0500 0 1 0 0 1

Splitting up data into 80% train data and 20% test data

In [88]:
# Split data into training and testing set
X = titanic_rf.iloc[:,1:]
Y = titanic_rf["survived"]
X_train, X_test, Y_train, Y_test = model_selection.train_test_split(X, Y, test_size = 0.2)

(1) Random Forest Classification Algorithm - Implementation and Results

In [89]:
# Random Forest Classification Algorithm applied to train data
RF = RandomForestClassifier(n_jobs = 2)
RF.fit(X_train, Y_train)
RF.score(X_train, Y_train)
Out[89]:
0.97046413502109707
In [90]:
# Predict Random Forest Algorithm on Test Data
predictions_RF = RF.predict(X_test)
In [91]:
# Print Accuracy Score for Random Forest Algorithm
acc=accuracy_score(Y_test, predictions_RF)
print('Accuracy :: ',acc)
Accuracy ::  0.786516853933
In [92]:
# Classification Report of Prediction
print(classification_report(Y_test, predictions_RF))
             precision    recall  f1-score   support

          0       0.78      0.86      0.82       101
          1       0.79      0.69      0.74        77

avg / total       0.79      0.79      0.78       178

Here, it can be observed that 79% (78% not survived and 79% survived) of the test data is predicted precisely. Columns 0 and 1 represent "not survived" and "survived" respectively.

In [93]:
# Confusion Matrix for predictions made
confusion_matrix1 = confusion_matrix(Y_test, predictions_RF)
print(confusion_matrix1)
[[87 14]
 [24 53]]
In [94]:
class_names=[0,1] # name  of classes
fig, ax = plt.subplots()
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)
# create heatmap
sb.heatmap(pd.DataFrame(confusion_matrix1), annot=True, cmap="YlGnBu" ,fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion Matrix for Random Forest Classification Algorithm', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label');
In [95]:
rand_predict = RF.predict(X_test)
rand_predict
Out[95]:
array([0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0,
       1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1,
       0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1,
       0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0,
       0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0,
       1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0,
       1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1,
       1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1])
In [96]:
Final_predictions = pd.DataFrame({'survived':rand_predict})

Final_predictions.to_csv('sample_submission.csv',index=False)
In [97]:
Final_predictions.head()
Out[97]:
survived
0 0
1 1
2 0
3 1
4 1

Feature Importance

Another great quality of random forest is that they make it very easy to measure the relative importance of each feature. Sklearn measure a features importance by looking at how much the treee nodes, that use that feature, reduce impurity on average (across all trees in the forest). It computes this score automaticall for each feature after training and scales the results so that the sum of all importances is equal to 1. We will acces this below:

In [98]:
importances = pd.DataFrame({'feature':X_train.columns,'importance':np.round(RF.feature_importances_,3)})
importances = importances.sort_values('importance',ascending=False).set_index('feature')
importances.head(15)
Out[98]:
importance
feature
fare 0.271
age 0.236
sex_female 0.170
sex_male 0.115
pclass 0.083
sibsp 0.051
parch 0.042
embarked_S 0.016
embarked_C 0.012
embarked_Q 0.005
In [99]:
importances.plot.bar();

Conclusion:

Embarked category, sibsp and parch doesn’t play a significant role in our random forest classifiers prediction process. Because of that I will drop them from the dataset and train the classifier again. We could also remove more or less features, but this would need a more detailed investigation of the features effect on our model. But I think it’s just fine to remove Embarked category, sibsp and parch.

Training Random Forest Again

In [100]:
train_df  = titanic_rf.drop("embarked_S", axis=1)
test_df  = titanic_rf.drop("embarked_S", axis=1)

train_df  = titanic_rf.drop("embarked_Q", axis=1)
test_df  = titanic_rf.drop("embarked_Q", axis=1)

train_df  = titanic_rf.drop("embarked_C", axis=1)
test_df  = titanic_rf.drop("embarked_C", axis=1)

train_df  = titanic_rf.drop("parch", axis=1)
test_df  = titanic_rf.drop("parch", axis=1)

train_df  = titanic_rf.drop("sibsp", axis=1)
test_df  = titanic_rf.drop("sibsp", axis=1)
In [101]:
# Random Forest

random_forest = RandomForestClassifier(n_estimators=100, oob_score = True)
random_forest.fit(X_train, Y_train)
Y_prediction = random_forest.predict(X_test)

random_forest.score(X_train, Y_train)

acc_random_forest = round(random_forest.score(X_train, Y_train) * 100, 2)
print(round(acc_random_forest,2,), "%")
98.59 %

Our random forest model predicts as good as it did before. A general rule is that, the more features you have, the more likely your model will suffer from overfitting and vice versa. But I think our data looks fine for now and hasn't too much features.

In [102]:
# Predict Random Forest Algorithm on Test Data
predict_RF = random_forest.predict(X_test)
In [103]:
# Print Accuracy Score for Random Forest Algorithm
acc2=accuracy_score(Y_test, predict_RF)
print('Accuracy :: ',acc2)
Accuracy ::  0.786516853933
In [104]:
# Classification Report of Prediction
print(classification_report(Y_test, predict_RF))
             precision    recall  f1-score   support

          0       0.78      0.86      0.82       101
          1       0.79      0.69      0.74        77

avg / total       0.79      0.79      0.78       178

In [105]:
# Confusion Matrix for predictions made
confusion_matrix11 = confusion_matrix(Y_test, predict_RF)
print(confusion_matrix11)
[[87 14]
 [24 53]]
In [106]:
class_names=[0,1] # name  of classes
fig, ax = plt.subplots()
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)
# create heatmap
sb.heatmap(pd.DataFrame(confusion_matrix11), annot=True, cmap="YlGnBu" ,fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion Matrix for Random Forest Classification Algorithm', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label');
In [107]:
rand_predict2 = random_forest.predict(X_test)
rand_predict2
Out[107]:
array([0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0,
       1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1,
       0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1,
       0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0,
       0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0,
       1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0,
       1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1,
       1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1])
In [108]:
Final_predictions2 = pd.DataFrame({'survived':rand_predict2})

Final_predictions2.to_csv('sample_submission.csv',index=False)
In [109]:
Final_predictions2.head()
Out[109]:
survived
0 0
1 1
2 0
3 0
4 1

(2) Linear Discrimination Analysis - Implementation and Results

In [110]:
# Convert string values to int values for the ease of prediction
titanic_lda = pd.get_dummies(titanic_lda)
titanic_lda.head()
Out[110]:
survived pclass age sibsp parch fare sex_female sex_male embarked_C embarked_Q embarked_S
0 0 3 22.0 1 0 7.2500 0 1 0 0 1
1 1 1 38.0 1 0 71.2833 1 0 1 0 0
2 1 3 26.0 0 0 7.9250 1 0 0 0 1
3 1 1 35.0 1 0 53.1000 1 0 0 0 1
4 0 3 35.0 0 0 8.0500 0 1 0 0 1
In [111]:
##Drop one of the dummy variable for each column
titanic_lda.drop(['sex_female','embarked_C'],axis=1,inplace=True)
In [112]:
# Split data into training and testing set
X1 = titanic_lda.iloc[:,1:]
Y1 = titanic_lda["survived"]
X1_train, X1_test, Y1_train, Y1_test = model_selection.train_test_split(X1, Y1, test_size = 0.2)
In [113]:
# LDA applied to train data
lda = LinearDiscriminantAnalysis()
lda.fit(X1_train,Y1_train)
pred_lda = lda.predict(X1_test)
In [114]:
# Print Accuracy Score for Linear Discrimination Analysis
acco=accuracy_score(Y1_test, pred_lda)
print('Accuracy :: ',acco)
Accuracy ::  0.820224719101
In [115]:
# Classification Report of Prediction
print(classification_report(Y1_test, pred_lda))
             precision    recall  f1-score   support

          0       0.85      0.88      0.87       118
          1       0.75      0.70      0.72        60

avg / total       0.82      0.82      0.82       178

Here, it can be observed that 81% (82% not survived and 80% survived) of the test data is predicted precisely. Columns 0 and 1 represent "not survived" and "survived" respectively.

In [116]:
# Confusion Matrix for predictions made
confusion_matrix2 = confusion_matrix(Y1_test,pred_lda)
print(confusion_matrix2)
[[104  14]
 [ 18  42]]
In [117]:
class_names=[0,1] # name  of classes
fig, ax = plt.subplots()
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)
# create heatmap
sb.heatmap(pd.DataFrame(confusion_matrix2), annot=True, cmap="YlGnBu" ,fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion Matrix for Linear Discrimination Analysis', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label');

(3) Logistic Regression - Implementation and Results

In [118]:
titanic_log.head()
Out[118]:
survived pclass sex age sibsp parch fare embarked
0 0 3 male 22.0 1 0 7.2500 S
1 1 1 female 38.0 1 0 71.2833 C
2 1 3 female 26.0 0 0 7.9250 S
3 1 1 female 35.0 1 0 53.1000 S
4 0 3 male 35.0 0 0 8.0500 S
In [119]:
# Convert string values to int values for the ease of prediction
titanic_log = pd.get_dummies(titanic_log)
titanic_log.head()
Out[119]:
survived pclass age sibsp parch fare sex_female sex_male embarked_C embarked_Q embarked_S
0 0 3 22.0 1 0 7.2500 0 1 0 0 1
1 1 1 38.0 1 0 71.2833 1 0 1 0 0
2 1 3 26.0 0 0 7.9250 1 0 0 0 1
3 1 1 35.0 1 0 53.1000 1 0 0 0 1
4 0 3 35.0 0 0 8.0500 0 1 0 0 1
In [120]:
##Drop one of the dummy variable for each column
titanic_log.drop(['sex_female','embarked_C'],axis=1,inplace=True)
In [121]:
titanic_log.head()
Out[121]:
survived pclass age sibsp parch fare sex_male embarked_Q embarked_S
0 0 3 22.0 1 0 7.2500 1 0 1
1 1 1 38.0 1 0 71.2833 0 0 0
2 1 3 26.0 0 0 7.9250 0 0 1
3 1 1 35.0 1 0 53.1000 0 0 1
4 0 3 35.0 0 0 8.0500 1 0 1
In [122]:
X=titanic_log.drop("survived",axis=1)

y=titanic_log['survived']
In [123]:
import statsmodels.api as sm
logit_model=sm.Logit(y,X)
result=logit_model.fit()
print(result.summary2())
Optimization terminated successfully.
         Current function value: 0.496464
         Iterations 6
                        Results: Logit
===============================================================
Model:              Logit            No. Iterations:   6.0000  
Dependent Variable: survived         Pseudo R-squared: 0.254   
Date:               2020-09-05 21:58 AIC:              898.7123
No. Observations:   889              BIC:              937.0331
Df Model:           7                Log-Likelihood:   -441.36 
Df Residuals:       881              LL-Null:          -591.41 
Converged:          1.0000           Scale:            1.0000  
---------------------------------------------------------------
                Coef.  Std.Err.    z     P>|z|   [0.025  0.975]
---------------------------------------------------------------
pclass          0.0338   0.0859   0.3935 0.6939 -0.1346  0.2022
age             0.0041   0.0058   0.7114 0.4768 -0.0072  0.0154
sibsp          -0.2919   0.0947  -3.0813 0.0021 -0.4776 -0.1062
parch          -0.1095   0.1122  -0.9763 0.3289 -0.3295  0.1104
fare            0.0183   0.0030   6.0474 0.0000  0.0124  0.0243
sex_male       -2.2583   0.1808 -12.4914 0.0000 -2.6127 -1.9040
embarked_Q      0.2772   0.3549   0.7811 0.4347 -0.4184  0.9728
embarked_S      0.2658   0.2189   1.2144 0.2246 -0.1632  0.6949
===============================================================

/opt/conda/lib/python3.6/site-packages/statsmodels/compat/pandas.py:56: FutureWarning: The pandas.core.datetools module is deprecated and will be removed in a future version. Please use the pandas.tseries module instead.
  from pandas.core import datetools
In [124]:
#checking the logistic model coefficient
result.params
Out[124]:
pclass        0.033812
age           0.004112
sibsp        -0.291909
parch        -0.109541
fare          0.018328
sex_male     -2.258341
embarked_Q    0.277227
embarked_S    0.265842
dtype: float64
In [125]:
#Exponentiate the coefficients to get the log odds
np.exp(result.params)
Out[125]:
pclass        1.034391
age           1.004121
sibsp         0.746837
parch         0.896246
fare          1.018497
sex_male      0.104524
embarked_Q    1.319466
embarked_S    1.304529
dtype: float64
In [126]:
#Checking if variables are significant with their pvalues
result.pvalues 
Out[126]:
pclass        6.939379e-01
age           4.768353e-01
sibsp         2.061274e-03
parch         3.289216e-01
fare          1.471874e-09
sex_male      8.322324e-36
embarked_Q    4.347221e-01
embarked_S    2.246120e-01
dtype: float64
In [127]:
# odds ratios and 95% CI
params = result.params
conf = result.conf_int()
conf['OR'] = params
conf.columns = ['2.5%', '97.5%', 'OR']
np.exp(conf)
Out[127]:
2.5% 97.5% OR
pclass 0.874070 1.224117 1.034391
age 0.992808 1.015562 1.004121
sibsp 0.620277 0.899220 0.746837
parch 0.719319 1.116691 0.896246
fare 1.012465 1.024565 1.018497
sex_male 0.073337 0.148972 0.104524
embarked_Q 0.658120 2.645403 1.319466
embarked_S 0.849400 2.003527 1.304529
In [128]:
from sklearn.linear_model import LogisticRegression
logmodel=LogisticRegression()
logmodel.fit(X_train,Y_train)
Out[128]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

Predicting the test set results and calculating the accuracy

In [129]:
y_pred = logmodel.predict(X_test)
print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(logmodel.score(X_test, Y_test)))
Accuracy of logistic regression classifier on test set: 0.76

Confusion matrix

In [130]:
from sklearn.metrics import confusion_matrix
confusion_matrix3 = confusion_matrix(Y_test, y_pred)
print(confusion_matrix3)
[[83 18]
 [24 53]]
In [131]:
class_names=[0,1] # name  of classes
fig, ax = plt.subplots()
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)
# create heatmap
sb.heatmap(pd.DataFrame(confusion_matrix3), annot=True, cmap="YlGnBu" ,fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Logistic Regression Confusion matrix', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
Out[131]:
Text(0.5,257.44,'Predicted label')

The result is telling us that we have 83+51 correct predictions and 16+28 incorrect predictions.

Compute precision, recall, F-measure and support

In [132]:
# import the metrics class
from sklearn import metrics
print("Accuracy:",metrics.accuracy_score(Y_test, y_pred))
print("Precision:",metrics.precision_score(Y_test, y_pred))
print("Recall:",metrics.recall_score(Y_test, y_pred))
Accuracy: 0.76404494382
Precision: 0.746478873239
Recall: 0.688311688312
In [133]:
from sklearn.metrics import classification_report
print(classification_report(Y_test, y_pred))
             precision    recall  f1-score   support

          0       0.78      0.82      0.80       101
          1       0.75      0.69      0.72        77

avg / total       0.76      0.76      0.76       178

ROC Curve

In [134]:
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
logit_roc_auc = roc_auc_score(Y_test, logmodel.predict(X_test))
fpr, tpr, thresholds = roc_curve(Y_test, logmodel.predict_proba(X_test)[:,1])
plt.figure()
plt.plot(fpr, tpr, label='Logistic Regression (area = %0.2f)' % logit_roc_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.savefig('Log_ROC')
plt.show()

The receiver operating characteristic (ROC) curve is another common tool used with binary classifiers. The dotted line represents the ROC curve of a purely random classifier; a good classifier stays as far away from that line as possible (toward the top-left corner).

Which is the best model

In [136]:
results = pd.DataFrame({
    'Model': ['Random Forest', 'LDA', 'Logistic Regression'],'Score': [acc2, acco, metrics.accuracy_score(Y_test, y_pred)]})
result_df = results.sort_values(by='Score', ascending=False)
result_df = result_df.set_index('Score')
result_df.head(9)
Out[136]:
Model
Score
0.820225 LDA
0.786517 Random Forest
0.764045 Logistic Regression

As it can be seen, the Linear Discrimination Analysis goes on the first place.we use cross validation.

In [ ]:
 
In [ ]:
 

CONCLUSION

  • The results of the analysis, although tentative, would appear to indicate that class and sex, namely, being a female with upper social-economic standing (first class), would give one the best chance of survival when the tragedy occurred on the Titanic. Age did not seem to be a major factor. While being a man in third class, gave one the lowest chance of survival. Women and children, across all classes, tend to have a higher survival rate than men in genernal but by no means did being a child or woman guarentee survival. Although, overall, children accompanied by parents (or nannies) had the best survival rate at over 50%.
  • Inferences, Oberservations, Conclusions for Random Forest Classification Algorithm:

Random Forest is useful for both Classification and Regression!

Also, it will create a multitude of (generally very poor) trees for the data set using different random subsets of the input variables, and will return whichever prediction was returned by the most trees.

This helps to avoid “overfitting”, a problem that occurs when a model is so tightly fitted to arbitrary correlations in the training data that it performs poorly on test data.

This model is created, where, if given some new data related to titanic, can predict if the person will/will not survive, with an accuracy of 78.65%.

  • Inferences, Oberservations, Conclusions for Linear Discrimination Analysis:

In the dataset we find that the independent variables are not normally distributed, which is the fundamental assumption while using LDA.

Moreover, by using LDA we find that it offers a better accuracy and recall, when compared to Random Forest.

This model is created, where, if given some new data related to titanic, can predict if the person will/will not survive, with an accuracy of 82.02%.

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

  • I extended my investigation of fare against pclass and survival in this section by looking at the impact of fare and pclass against the chance of survival. The multivariate exploration here showed that there indeed is a positive effect of increased survival rate on the fare passengers paid for ticket.

  • I also look at fare against age and survival in this section by looking at the impact of fare and age against the chance of survival. The multivariate exploration here showed that the young passengers have higher chance of surviving than the old people.

  • During the data preprocessing part, we computed missing values, converted features into numeric ones, grouped values into categories and created a few new features. Afterwards we started training 3 different machine learning models, picked one of them (LDA) and applied cross validation on it. Then we discussed how LDA works, took a look at the importance it assigns to the different features and tuned it’s performace through optimizing it’s hyperparameter values. Lastly, we looked at it’s confusion matrix and computed the models precision, recall and f-score.

In [ ]: